Kindly reachout on the above given email id if you encounter any problems while running the script.
You are consulting for a real estate company that has a niche in purchasing properties to rent out short-term as part of their business model specifically within New York City. The real estate company has already concluded that two bedroom properties are the most profitable; however, they do not know which zip codes are the best to invest in.
Identify zip codes that would generate the most profit on short term rentals within New York City.
2 bed room sets are most profitable (as identified by the client)
Occupancy rate of 75% has been assumed. Further, it has been assumed that these occupancies would be one day occupancies.
The investor will pay for the property in cash (i.e. no mortgage/interest rate will need to be accounted for).
The time value of money discount rate is 0% (i.e. $1 today is worth the same 100 years from now).
All properties and all square feet within each locale can be assumed to be homogeneous (i.e. a 1000 square foot property in a locale such as Bronx or Manhattan generates twice the revenue and costs twice as much as any other 500 square foot property within that same locale.)
The median of monthly median property prices for a RegionName (zipcode) will correctly reflect the median market cost of property in a zip code
For 2017, the median price for that year is considered as the median price for the 6 months because the data is only available for 6 months in 2017.
CAGR (Compound Annual Growth Rate) would be an appropriate indicator to forecast property and rent prices.
2020 median property costs would be better forecasted using CAGR for 3 year growth rate and forecasting the 2020 property costs using 2017 median property costs.
It has been given that all properties and all square feet within each locale can be assumed to be homogeneous. Hence, each zipcode in a neighbourhood group will have median price for 2020 as the median property price. I have assumed that the per unit square feet property cost for a neighbourhood would be the median price for 2020 / median square foot of properties in a neighbourhood.
The increase in rent per night (price) for 2020 would be at the same rate as the increase in property values (3 year CAGR) for that zip code from 2017 to 2020. In case the CAGR is negative then there would be no increase in rent. It will remain the same.
Since the break even period is longer number of years, I've used the CAGR for 5 years. This 5 year CAGR was calculated from 2012 - 2017 change in property cost value in Zillow file as the rate at which rent will increase in the years following 2020. Per Night rent would be charged for all the days. 75% occupancy has been assumed in a year for all neighbourhoods and zip codes.
!pip install handcalcs
!pip install ipywidgets>=7.0.0
!pip install folium
!pip install plotly_express==0.4.0
#Custom preprocessing modules
from preprocessing import *
from kpis import *
#Python dependencies
import gzip
import re
import math
import urllib.request, json
import warnings
warnings.filterwarnings('ignore')
#Data processing libraries
import pandas as pd
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 150)
import numpy as np
#Data Visualization libraries
import seaborn as sns
from ipywidgets import widgets
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import numpy as np
import handcalcs.render #To render Metrics equations in the form of handwritten calculations
#Imports for ML model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
with gzip.open('listings.csv.gz') as f:
airbnb = pd.read_csv(f)
#Keeping only important features
subset = ['last_scraped','neighbourhood_group_cleansed','city', 'state', 'zipcode', 'is_location_exact',\
'market','latitude', 'longitude', 'property_type', 'room_type','accommodates', 'bathrooms', 'bedrooms', 'beds', 'bed_type', 'square_feet',\
'price', 'cleaning_fee','extra_people', 'availability_30', 'availability_60', 'availability_90',
'availability_365', 'number_of_reviews']
ab = airbnb[subset]
#Dictionary to store all variants of New York name that will be later used to map appropritate names
di = {"Ny": "NY", "ny": "NY", "New York ": "NY"}
ab.replace({"state" : di}, inplace = True)
ab = ab[ab.state == 'NY']
ab = city_cleaner(ab, 'city')
# Selecting only 2 bedroom properties in New York City
ab = ab[(ab.bedrooms==2.0) & (ab.state=="NY") & ((ab.city == "New York City") | (ab.market== "New York"))]
from preprocessing import numeric_cleaner
#Stored currency related columns in an array
currency = ['price', 'extra_people', 'cleaning_fee']
#Cleaning all currency related column
ab = numeric_cleaner(ab, currency)
#One row has price per night for 0.0 which is a data discrepancy
#Used Boolean masking to drop rows where price is 0
mask = ab[ab.price==0]
ab.drop(mask.index, inplace = True)
ab['price'].describe(percentiles=[0.25,0.50,0.75,0.95,0.975])
px.box(ab, y = 'price', width = 450., title = 'Variation in Rent with Outliers' )
From the statistics and the boxplot above, it is clear that prices for few of the properties are way higher then rest of the properties. We shall remove such properties from our dataset.
#Removing outliers (2.5% of data)
ab = ab[ab['price']<600]
px.box(ab, y = 'price', width = 450., title = 'Variation in Rent after removing Outliers' )
After removing the outliers in the price column, the box plot now describes some acceptable statistics.
# Checking how many variants of zipcode present. Keep a count by length of zipcode
from collections import defaultdict
d = defaultdict(int)
for i in ab['zipcode']:
d[len(str(i))]+=1
d
#Removed rows where Zipcode not available. Tried to fill null values in Zipcode using 3rd party package called
# usszipcode but found it to be inaccurate while manually inspecting the package generated values with values in dataset
ab = ab[ab['zipcode'].isna()==False]
ab = ab[(ab.zipcode.str.len() == 5)]
#We need to fill missing values in Cleaning_fee column based on Neighbourhood_group_cleansed and property_type
compare_stats = ab.groupby(['neighbourhood_group_cleansed', 'property_type'])['cleaning_fee'].describe(percentiles = [0.25,0.50,0.75,0.95])
compare_stats
compare_stats = compare_stats.reset_index()
compare_stats_bronx = compare_stats[compare_stats['neighbourhood_group_cleansed'] == 'Bronx']
compare_stats_manhattan = compare_stats[compare_stats['neighbourhood_group_cleansed'] == 'Manhattan']
compare_stats_Queens = compare_stats[compare_stats['neighbourhood_group_cleansed'] == 'Queens']
compare_stats_StatenIsland = compare_stats[compare_stats['neighbourhood_group_cleansed'] == 'Staten Island']
from plotly.subplots import make_subplots
neighborhoods = [compare_stats_bronx, compare_stats_manhattan, compare_stats_Queens, compare_stats_StatenIsland]
name = ['Bronx', 'Manhattan', 'Queens', 'Staten Island']
fig = make_subplots(rows=2, cols=2, vertical_spacing = 0.30, horizontal_spacing = 0.20, row_heights =[800, 800],\
subplot_titles = ['{} Cleaning fee trend (95%ile and max value)'.format(title) for title in name],
)
r = 1
c = 1
i = 0
for index, df in enumerate(neighborhoods):
a = fig.add_trace(go.Scatter(x=df['property_type'], y=df['95%']), row=r, col=c )
b = fig.add_trace( go.Scatter(x=df['property_type'], y=df['max']), row=r, col=c )
a.update_layout(height = 800)
if(r==1 and c==1):
c+=1
elif(r==1 and c==2):
r+=1
c-=1
continue
elif(r==2 and c==1):
c+=1
i+=1
fig.show()
The four plots show statistical trends for 95 percentile and maximum value of Cleaning fees for the Bronx, Manhattan, Queens and Staten Island neighborhoods. The cleaning_fee columns comprises of 847 missing values and it was important to look at these statistics so that we can decide whether can fill the null values by "Mean" or "Median" cleaning fee of their repective neighborhoods and property type. We can see stark differences for the following observations:
#Filling null values in Cleaning_fee column by median cleaning_Fee as per neighborhood and property type
ab["cleaning_fee"] = ab.groupby(['neighbourhood_group_cleansed', 'property_type'])['cleaning_fee'].transform(lambda x: x.fillna(x.median()))
ab = ab[ab['cleaning_fee'].notna()]
There are only 97 observations with square feet area. We cannot expect to get a highly accurate model by training on such less number of values but as proof of concept for this project, we will go ahead and build it since this column will play an important role in calculating cost of properties. We're being told to assume that property cost and square foot area are homogenous in a given neighbourhood.
Hence we will build a model that would predict the square_feet of the property.
Inorder to determine which feature best corelates with the square_feet column, let's do some analysis.
#Calculating median rent for each neighbourhood
median_rent = ab.groupby(['neighbourhood_group_cleansed'])[['price']].agg('median')
median_rent.columns = ['median_rent']
fig = px.bar(median_rent, x=median_rent.index , y = 'median_rent',\
text = median_rent['median_rent'], title = 'Median rent across Neighborhoods',height = 600, orientation = 'v', color = median_rent.index )
fig
We see that Manhattan is the most expensive neighborhood to rent property which is followed by Brooklyn and Queens.
Now, let's check trend for Rent and cleaning fee with Square feet.
fig = px.scatter(ab, x= "square_feet", y='price', color= "neighbourhood_group_cleansed", trendline = 'ols',\
labels = {"square_feet": "Square feet", "price": "Rent"}, title = 'Square feet vs Rent',\
height = 750 )
fig.update_traces(marker=dict(size=12),
selector=dict(mode='markers'))
fig
From the above scatter plot, it can be inferred that for all the neighborhood groups, rent increases with size of the property. The trendline for Queens neighborhood does not show a strong linear relationship because of fewer datapoints.
fig = px.scatter(ab, x= "square_feet", y='cleaning_fee', color= "neighbourhood_group_cleansed", trendline = 'ols',\
labels = {"square_feet": "Square feet", "cleaning_fee": "Cleaning fee"}, title = 'Square feet vs Cleaning fee',\
height = 750 )
fig.update_traces(marker=dict(size=12),
selector=dict(mode='markers'))
fig
The above plot shows how trend for Cleaning fee changes with respect to size of the property (square feet). From the current 97 data points that we have for square feet, it does not appear that there's a strong relationship between them.
fig = px.histogram(ab, x="square_feet", color="room_type",
title = 'Distribution of properties as per their room-_type and square_feet')
fig
Now, let's plot a corelation heatmap to identify the features that corelate with square_feet
#Get all continuous variables
for_heatmap = ab[['square_feet', 'price', 'cleaning_fee',
'extra_people', 'number_of_reviews']]
# Increase the size of the heatmap.
# Store heatmap object in a variable to easily access it when you want to include more features (such as title).
# Setting the range of values to be displayed on the colormap from -1 to 1, and setting the annotation to True to display the correlation values on the heatmap.
heatmap = sns.heatmap(for_heatmap.corr(), vmin=-1, vmax=1, annot=True)
# Giving a title to the heatmap. Pad defines the distance of the title from the top of the heatmap.
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12);
We see that only Price feature shows some corelation with Square_feet.
#Seaprating out the features that seems to be useful to predict square_feet
ab_for_ml = ab[['zipcode', 'accommodates', 'bathrooms',
'bedrooms', 'beds', 'price', 'cleaning_fee','neighbourhood_group_cleansed','room_type', 'square_feet']]
#filtering out null values from square feet column so that the dataset thus curated can be used to train ML model
ab_known = ab_for_ml[ab_for_ml.square_feet.isna()==False]
#X will be the predictor columns
X = ab_known[['zipcode','neighbourhood_group_cleansed','room_type','accommodates', 'bathrooms',
'bedrooms', 'beds', 'price', 'cleaning_fee']]
#We will need to process categorical and numerical features separately before passing it to ML model. Hence creating
#a separate list of categorical and numerical features
#col_cat is a list of Categorical features
col_cat = ['zipcode','neighbourhood_group_cleansed','room_type']
#col_num is a list of Numerical features
col_num = ['accommodates', 'bathrooms', 'bedrooms', 'beds', 'price', 'cleaning_fee']
#X_num is the dataframe with numerical features
X_num = ab_known[col_num]
#X_cat is the dataframe with categorical features
X_cat = ab_known[col_cat]
#y is the target label
y = ab_known[['square_feet']]
#Splitting dataset (wherein we have known square_feet values) into Training and testing.
#Training will comprise of 80% of the data and testing will comprise of 20% of the data.
X__train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2)
from sklearn.metrics import r2_score, mean_squared_error, make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso
def MSE(y_true,y_pred):
mse = mean_squared_error(y_true, y_pred)
print(f'MSE: {mse}')
return mse
def R2(y_true,y_pred):
r2 = r2_score(y_true, y_pred)
print(f'R2: {r2}')
return r2
def two_score(y_true,y_pred):
MSE(y_true,y_pred) #set score here and not below if using MSE in GridCV
score = R2(y_true,y_pred)
return score
def two_scorer():
return make_scorer(two_score, greater_is_better=True)
from sklearn.linear_model import Lasso
pipe_cat = make_pipeline(SimpleImputer(strategy='constant'), OneHotEncoder(handle_unknown='ignore'))
pipe_num = make_pipeline(StandardScaler(), SimpleImputer())
preprocessor = make_column_transformer((pipe_cat, col_cat), (pipe_num, col_num))
pipe = Pipeline(steps = [('preprocessor',preprocessor),
('lasso', Lasso(max_iter=10000))
])
param_grid = {
'lasso__alpha':[0.03, 0.3, 3]}
lasso_grid = GridSearchCV(estimator = pipe, param_grid=param_grid, scoring = two_scorer(), cv=10, n_jobs=-1, return_train_score=True)
lasso_grid.fit(X__train,y_train)
print(lasso_grid.best_estimator_)
mean_squared_error(y_train, lasso_grid.predict(X__train))
lasso_grid.score(X__train, y_train)
print("Training set score: {:.2f}".format(lasso_grid.score(X__train, y_train)))
print("Test set score: {:.2f}".format(lasso_grid.score(X_test, y_test)))
# print("Number of features used:", np.sum(lasso_grid.coef_ != 0))
['bathrooms', 'bedrooms', 'beds', 'square_feet', 'price', 'cleaning_fee', 'extra_people', 'review_scores_rating', 'review_scores_location', 'reviews_per_month']
# Creating the dataset to predict missing square_feet
sqfeet_predict = ab[ab.square_feet.isna()==True]
sqfeet_predict = sqfeet_predict[['zipcode','neighbourhood_group_cleansed','room_type','accommodates', 'bathrooms',
'bedrooms', 'beds', 'price', 'cleaning_fee', 'square_feet']]
X_pred = sqfeet_predict[['zipcode','neighbourhood_group_cleansed','room_type','accommodates', 'bathrooms',
'bedrooms', 'beds', 'price', 'cleaning_fee']]
col_cat = ['zipcode','neighbourhood_group_cleansed','room_type']
col_num = ['accommodates', 'bathrooms', 'bedrooms', 'beds', 'price', 'cleaning_fee']
X_num_pred = sqfeet_predict[col_num]
X_cat_pred = sqfeet_predict[col_cat]
y_pred = sqfeet_predict[['square_feet']]
#Predicting missing square_feet values by passing predictor columns to the lasso model
y_predicted = lasso_grid.predict(X_pred)
#Assigning the predicted values to the dataset
sqfeet_predict['square_feet'] = y_predicted
#Merging the dataframe containing predicted values for square_feet with the original dataframe containing missing values
ab_after_prediction = pd.merge(ab, sqfeet_predict, how = 'inner', left_on = ab.index ,right_on = sqfeet_predict.index)
ab_after_prediction.head()
ab_after_prediction.drop(columns = ['zipcode_y', 'neighbourhood_group_cleansed_y', 'room_type_y',
'accommodates_y', 'bathrooms_y', 'bedrooms_y', 'beds_y', 'price_y',
'cleaning_fee_y'], inplace = True)
#Replacing null values in the square_feet column with the from the predicted square_feet column
ab_after_prediction['square_feet_x'].fillna(ab_after_prediction['square_feet_y'], inplace = True)
ab_after_prediction['square_feet_x'].describe()
#Square_feet cannot be negative. Hence we will drop these 24 rows from the dataset
(ab_after_prediction[ab_after_prediction['square_feet_x']<0]).shape[0]
#Filtered dataset with only positive values for square_feet
ab_after_prediction = ab_after_prediction[ab_after_prediction['square_feet_x']>0]
#Statistics for the Square_feet colunm to check whether we have any outliers in the dataset
ab_after_prediction['square_feet_x'].describe(percentiles = [0.05,0.1,0.25,0.50,0.75,0.85,0.95,0.99])
fig = px.scatter(ab_after_prediction, x= "square_feet_x", y='price_x', color= "neighbourhood_group_cleansed_x", trendline = 'ols',\
labels = {"square_feet_X": "Square feet", "price_x": "Rent"}, title = 'Square feet vs Rent',\
height = 750 )
fig.update_traces(marker=dict(size=5),
selector=dict(mode='markers')
)
fig
The above graph shows a linear relationship between square_feet (including predicted values) and rent for all the neighborhoods except Staten Island.
ab_after_prediction.drop(columns = ['key_0','is_location_exact', 'square_feet_y'], inplace = True)
#Renaming columns
ab_after_prediction.columns = ['last_scraped', 'neighbourhood_group_cleansed', 'city', 'state', 'zipcode',
'market', 'latitude', 'longitude', 'property_type',
'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds',
'bed_type', 'square_feet', 'price', 'cleaning_fee',
'extra_people', 'availability_30', 'availability_60', 'availability_90',
'availability_365', 'number_of_reviews']
ab_after_prediction.dropna(inplace = True)
# ab_after_prediction.to_csv('ab_after_prediction.csv')
# ab_after_prediction = pd.read_csv('ab_after_prediction.csv')
ab_after_prediction['zipcode'] = ab_after_prediction['zipcode'].astype('str')
zillow = pd.read_excel('(DC 4) Zip_Zhvi_2bedroom (1).xlsx')
#We see that few of the columns have their names in datetime format. We need to change them to string.
#Lets first extract those datetime columns
date_cols = [col for index, col in enumerate(list(zillow.columns)) if index >6]
#Converting them to string "yyyy-mm" format
date_cols_converted = [col.strftime('%Y-%m')for col in date_cols]
#Creating a new list of modified column names
zillow_cols = list(zillow.columns)[:7] + date_cols_converted
#Assigning new column names to original data
zillow.columns = zillow_cols
#Keep only those rows where City is New york and State is New York
zillow = zillow[(zillow['City']=='New York') & (zillow['State'] == 'NY')]
#Heatmap to check missing patterns in the data
cmap = sns.cubehelix_palette(light=1, as_cmap=True, reverse=True)
sns.heatmap(zillow.isnull(), cmap = cmap)
For monthly median cost of properties, we do not have any cost missing after 2007. Hence we will drop data from the year 1996 to 2006.
#unpivoting the dataframe (Converting from wide to long format)
zillow = zillow.melt(id_vars=['RegionID', 'RegionName', 'City', 'State', 'Metro', 'CountyName',
'SizeRank'], var_name='date', value_name='price')
#splitting the date column into Year and Month
month_and_year = zillow['date'].str.split('-',expand = True)
#create separate column for year
zillow['year'] = month_and_year[0]
#create separate column for month
zillow['month'] = month_and_year[1]
zillow.isna().sum()
#Drop all 1996 year rows
zillow = zillow[zillow.year>str(2006)]
#Yearly median price by Zipcode
zillow = zillow.groupby(['RegionName','CountyName','year'])[['price']].agg('median')
zillow = zillow.reset_index()
zillow.head()
zillow = zillow.pivot(index = ['RegionName', 'CountyName'] , columns = ['year'], values = 'price')
zillow = zillow.reset_index()
zillow.head()
%%render
medianPrice_2012 = 1000000.0 #Consider median price for properties a zipcode for the year 2012 is 1M USD
medianPrice_2017 = 2000000.0 #Consider median price for properties in a zipcode for the year 2017 is 2M USD
n = 5
CAGR = (((medianPrice_2017/medianPrice_2012)**(1/n))-1)*100
#Calculating CAGR of 2, 5 and 10 years respectively
zillow['cagr_10'] = rate(zillow, '2017', '2007')
zillow['cagr_5'] = rate(zillow, '2012', '2017')
zillow['cagr_2'] = rate(zillow, '2015', '2017')
zillow['cagr_3'] = rate(zillow, '2014', '2017')
zillow['RegionName'] = zillow['RegionName'].astype('str')
zillow.head()
#Merging Airbnb and Zillow data
df_merged = pd.merge(ab_after_prediction, zillow, how = 'inner', left_on = 'zipcode', right_on = 'RegionName')
# df_merged.drop('key_0', axis =1 ,inplace = True)
%%render
MedianCost_2017 = 1000000
CAGR_3year = 5.0
MedianCost_2020 = (MedianCost_2017)*((1+(CAGR_3year/100))**2)
#Calculate 2020 median cost of properties by forecasting
df_merged['2020'] = round((df_merged['2017']*((1+(df_merged.cagr_3/100))**2)),1)
# df_merged['per_sq_feet'] = df_merged['2020']/df_merged['square_feet_x']
zipcode_median_square_feet = df_merged.groupby(['zipcode'])[['square_feet']].agg('median')
zipcode_median_square_feet = zipcode_median_square_feet.reset_index()
zipcode_median_square_feet.columns = ['zipcode', 'median_sq_feet']
df_merged = pd.merge(df_merged, zipcode_median_square_feet, how = 'left', left_on = 'zipcode',\
right_on = 'zipcode')
%%render
medianPrice_2020_forZipcodexxxxx = 1000000.0 #Median price for properties in a particular zipcode for the year 2012
medianSquareFeet_forZipcodexxxxx = 1200.0 #Median Square feet for all properties in a particular Zipcode. Obtained from Airbnb dataset
CostPerSquareFoot_forZipcodexxxxx = (medianPrice_2020_forZipcodexxxxx)/(medianSquareFeet_forZipcodexxxxx) #This will be the cost per square foot for a particular Zip Code
# df_merged['cost_per_sq_foot'] = round((df_merged['2020']/df_merged['median_sq_feet']),1)
df_merged = cost_per_foot(df_merged)
%%render
CostPerSquareFoot_forZipcodexxxxx = 833.33
SquareFootofProperty_inZipcodexxxxx = 1000
CostOfProperty_2020 = CostPerSquareFoot_forZipcodexxxxx * SquareFootofProperty_inZipcodexxxxx
# df_merged['cost_2020'] = round((df_merged['cost_per_sq_foot']*df_merged['square_feet']), 1)
df_merged = cost_each_property(df_merged)
%%render
CostOfProperty_2020 = 1000000
CAGR_10yrs = 3.550
RoiEquity_10yrs = ((CostOfProperty_2020*(1+(CAGR_10yrs/100))**10) - CostOfProperty_2020)/CostOfProperty_2020
# df_merged['roi_equity'] = ((df_merged['cost_2020'] * ((1 +(df_merged['cagr_10']/100))**10)) - df_merged['cost_2020'])/df_merged['cost_2020']
#Function to calculate roi_equity
df_merged = roi_equity(df_merged)
%%render
rentPerNight = 200
AnnualRent_2020 = rentPerNight*0.75*365
df_merged['Annual_rent'] = df_merged['price']*0.75*365
%%render
rentPerNight = 200
AnnualRent_2020 = rentPerNight*0.75*365
CostOfProperty_2020 = 1000000
CAGR_10yrs = 3.550
AnnualRent_10yrs = AnnualRent_2020*10
Appreciation_propertyCost = (CostOfProperty_2020*(1+(CAGR_10yrs/100)**10) - CostOfProperty_2020)
ROI_10yrs = (AnnualRent_2020*10+Appreciation_propertyCost)/CostOfProperty_2020
df_merged = roi(df_merged, 10)
%%render
rentPerNight_2020 = 200
CostOfProperty_2020 = 1000000
CAGR_5yrs = 7.913
CostOfProperty_2020 = (rentPerNight_2020*365*0.75)*(1+(1+CAGR_5yrs/100)**1+(1+CAGR_5yrs/100)**2+(1+CAGR_5yrs/100)**3 )#and so on. We simplify this to the below eqation
t = math.log10(1 + (CostOfProperty_2020 * (CAGR_5yrs/100)/(365*0.75*rentPerNight)))/math.log10(1+CAGR_5yrs/100) #t denotes break even time in years
import numpy as np
df_merged['break_even_years'] = round((np.log10(1+(df_merged['cost_2020']*(df_merged['cagr_5']/100)/(365*0.75*df_merged['price'])))/np.log10(1 + df_merged['cagr_5']/100)),1)
#Calculating Median Break Even Years as per Zipcode
median_break_even = df_merged.groupby(['zipcode'])['break_even_years'].agg('median')
median_break_even = median_break_even.reset_index()
median_break_even.columns = ['zipcode', 'median_breakeven_yrs']
median_break_even.head()
df_merged = pd.merge(df_merged, median_break_even, how = "left", left_on = 'zipcode', right_on = 'zipcode')
#Remove columns with duplicate names
df_merged = df_merged.loc[:,~df_merged.columns.duplicated()]
df_merged.to_csv('all_metrics.csv')
df_ratios = df_merged.groupby(['zipcode'])[['median_breakeven_yrs', 'roi_equity', 'roi_10_yrs', '2020']].agg('median')
df_ratios.head()
df_ratios.to_csv('grouped_metrics.csv')
Users will want to know what type of room (Private room or Entire home/appartment) are yielding more revenue in each of the neighborhoods. We will first analyze Neighborhoods and then deep dive into Zipcodes in rest of the plots.
Renting out entire home/appartment will yield more revenue in all the 4 neighborhoods. However, Staten Island and Queens neighborhoods have only 26 and 16 datapoints. Also, there are many zipcodes in each of these neighborhoods and we will have to do zipcode level comparison inorder to be sure.
Business users can check how renting prices are varying across each neighborhoods.
data = pd.read_csv('all_metrics.csv')
data__1 = data.copy()
fig1 = px.box(data__1, x='room_type', y='price',facet_col="neighbourhood_group_cleansed",width=1600)
fig1.show()
neighborhoods_2 = list(data__1['neighbourhood_group_cleansed'].unique())
zipcodes_2=list(data__1[data__1['neighbourhood_group_cleansed']=='Manhattan']["zipcode"].unique())
origin_3 = widgets.Dropdown(
options = neighborhoods_2,
value='Manhattan',
description='Neighborhood :',
)
origin_4 = widgets.Dropdown(
options = zipcodes_2,
value=zipcodes_2[0],
description='zipcode :',
)
filter_list2 = [i and j for i,j in
zip(data__1['neighbourhood_group_cleansed'] == 'Manhattan',data__1['zipcode'] == 10025)]
trace__1 = go.Box(x= data__1[filter_list2]['room_type'], y=data__1[filter_list2]['price'])
g4 = go.FigureWidget(data=[trace__1],
layout=go.Layout(width=600, height=650, font_size=11,
barmode='group',
hovermode="closest",
title=dict(
text='Rent by neighborhood and roomtype'
)
))
g4.layout.xaxis.title = 'Neighborhood'
g4.layout.yaxis.title = 'Price'
def response2(change):
if origin_3.value:
filter_list2 = [i for i in
zip( data__1['neighbourhood_group_cleansed'] == origin_3.value)]
filter_list2 = [i[0] for i in filter_list2]
temp_df = data__1[filter_list2]
origin_4.options=list(temp_df["zipcode"].unique())
if origin_4.value:
temp_df=temp_df[temp_df["zipcode"]==origin_4.value]
else :
origin_4.value=origin_4.options[0]
temp_df=temp_df[temp_df["zipcode"]==origin_4.value]
with g4.batch_update():
g4.data[0].x = temp_df['room_type']
g4.data[0].y = temp_df["price"]
g4.layout.xaxis.title = 'room_type'
g4.layout.yaxis.title = 'Price'
origin_3.observe(response2, names="value")
origin_4.observe(response2, names="value")
container__2 = widgets.HBox([origin_3,origin_4])
widgets.VBox([container__2,g4])
d= pd.read_csv('all_metrics.csv')
d['zipcode'] = d['zipcode'].astype('str')
px.scatter(d, x='zipcode', y ='2020', color='neighbourhood_group_cleansed', labels = {'2020': 'Median cost of Property 2020'}, \
title = 'Median Cost of Property 2020 by Neighborhood and Zipcode',\
)
px.scatter(d, x='zipcode', y ='median_breakeven_yrs', color='neighbourhood_group_cleansed',\
labels = {'median_breakeven_yrs': 'Median break even period (years)'}
, title = 'Median break even years by neighborhood and zipcode')
#Calculating median rent for each neighbourhood
median_rent = ab.groupby(['neighbourhood_group_cleansed'])[['price']].agg('median')
median_rent.columns = ['median_rent']
fig = px.bar(median_rent, x=median_rent.index , y = 'median_rent',\
text = median_rent['median_rent'], title = 'Median rent across Neighborhoods',height = 800, orientation = 'v', color = median_rent.index )
fig
Identfy the profitable zipcodes based on the metrics calculated. We would prefer zipcodes that have less break even period and high Annual rent or ROI.
Attached separately with each graph.
Users will be able to identify profitable zipcodes based on any of the metrics calculated.
### Annual rent: It is the Rent generated by a property in a year with the assumption that it will be oocupied for 75% of the whole year.
### Break-even period (in years): It will give us an estimate of the time required for the collected rent generated by a property to be equal its purchasing cost.
### ROI Equity : It will give us an estimate of the ROI by considering only the appreciation in the value of the property.
### ROI : It is the most important metric that takes into account the annual rent and appreciation in the value of property inorder to calculate ROI.
dat=pd.read_csv("all_metrics.csv")
dat = dat[dat['roi_10_yrs']<10]
dat['zipcode'] = dat['zipcode'].astype('str')
fig = px.scatter(dat,x="break_even_years",y="Annual_rent",color="zipcode",
color_discrete_sequence=px.colors.qualitative.Alphabet, hover_data = {'zipcode': True,
'neighbourhood_group_cleansed': True},\
title = 'Annual Rent vs Breakeven period')
fig
fig=px.scatter(dat,x="break_even_years",y="roi_equity",color="zipcode",color_discrete_sequence=px.colors.qualitative.Alphabet,\
hover_name="zipcode",hover_data=["Annual_rent","neighbourhood_group_cleansed","break_even_years","roi_equity","cost_2020"])
fig.update_traces(textposition='top right',textfont_size=7)
fig.update_layout(height=600, width=1000)
fig=px.scatter(dat,x="break_even_years",y="roi_10_yrs",color="zipcode",\
color_discrete_sequence=px.colors.qualitative.Alphabet,hover_name="zipcode",\
hover_data=["Annual_rent","break_even_years","roi_10_yrs","cost_2020", 'neighbourhood_group_cleansed'])
fig.update_traces(textposition='top right',textfont_size=7)
fig.update_layout(height=500, width=900)
According to the graph, the top left quadrant is dominated by Zipcodes from Brooklyn neighborhood have high ROI ranging from 1.3 to 2.7 and relatively smaller break-even period : 11201, 11217, 11231, 11215.
Zipcodes 10036 and 10025 from Manhattan have less break even period with moderate ROI.
Zipcode 10305 from Staten Island has less break even point with high ROI.
fig = make_subplots(
rows=1, cols=3,
subplot_titles=("Annual_rent", "roi_equity", "roi_10_yrs"), )
fig.add_trace(go.Scatter(x=dat["break_even_years"], y=dat["Annual_rent"],\
mode="markers",text=list(dat["zipcode"]),name="Annual_rent"),
row=1, col=1)
fig.add_trace(go.Scatter(x=dat["break_even_years"], y=dat["roi_10_yrs"],\
mode="markers",text=list(dat["zipcode"]),name="roi"),
row=1, col=3)
fig.add_trace(go.Scatter(x=dat["break_even_years"], y=dat["roi_equity"],\
mode="markers",text=list(dat["zipcode"]),name="roi_equity"),
row=1, col=2)
fig.update_layout(height=500, width=1200,
title_text="side by side comparison vs payback Period")
fig.update_traces(textposition='top right',textfont_size=5)
fig.show()
dat=pd.read_csv("all_metrics.csv")
dat = dat[dat['roi_10_yrs']<10]
mapbox_access_token = "pk.eyJ1IjoieWFzaDk1MTYiLCJhIjoiY2tsYXJ5Zm5vMDVjMDJvbWR2cjFwcXd2MyJ9.cxAGwJ720HWnZC0kccSPDw"
year = widgets.IntRangeSlider(
value=[0,47],
min=2.0,
max=47.0,
step=1.0,
description='Break_even_period(years):',
continuous_update=False
)
roi_equity = widgets.FloatRangeSlider(
value=[-0.5,2.0],
min=-0.5,
max=2.0,
step=0.25,
description='roi_equity:',
continuous_update=False
)
roi = widgets.FloatRangeSlider(
value=[0,5],
min=0.0,
max=5.0,
step=0.25,
description='roi:',
continuous_update=False
)
annual = widgets.IntRangeSlider(
value=[0,dat["Annual_rent"].max()+1000],
min=0,
max=dat["Annual_rent"].max()+1000,
step=250,
description='annual_rent:',
continuous_update=False
)
zip_list=dat['zipcode'].unique().tolist()
zip_list.append("All")
container = widgets.HBox(children=[year,roi_equity,roi,annual])
textbox = widgets.Dropdown(
description='zipcodes: ',
value='All',
options=zip_list
)
import plotly.express as px
px.set_mapbox_access_token(mapbox_access_token)
data = [
go.Scattermapbox(
lat=dat["latitude"],
lon=dat["longitude"],
mode='markers',
marker=go.scattermapbox.Marker(
size=5,
),text=list(dat["zipcode"]),
hoverinfo="text"
)
]
g2 = go.FigureWidget(data=data,
layout=
go.Layout(
autosize=True,
hovermode='closest',
mapbox=go.layout.Mapbox(
accesstoken=mapbox_access_token,
bearing=0,
center=dict(
lat=40.7406,
lon=-73.9985
),
pitch=0,
zoom=9
),height=800,width=1000
))
def response(change):
filter_list = [i and j and k and l for i, j, k,l in
zip(dat['break_even_years'].between(year.value[0],year.value[1]), dat['roi_10_yrs'].between(roi.value[0],roi.value[1]),
dat['roi_equity'].between(roi_equity.value[0],roi_equity.value[1]),dat['Annual_rent'].between(annual.value[0],annual.value[1]))]
temp_df = dat[filter_list]
x1 = temp_df["latitude"]
x2 = temp_df['longitude']
with g2.batch_update():
g2.data[0].lat=x1
g2.data[0].lon = x2
g2.data[0].text=list(temp_df["zipcode"])
g2.layout.xaxis.title = 'Rent'
g2.layout.yaxis.title = 'Sale Price'
roi_equity.observe(response, names="value")
roi.observe(response, names="value")
year.observe(response, names="value")
annual.observe(response,names="value")
widgets.VBox([container,g2])
#Note: The map might throw an error after loading but the functionaily and interacivity of the map does not get affected by it.
# The error is due to a bug in the package.
#refer: https://github.com/plotly/plotly.py/issues/2570
import folium
data1=pd.read_csv("all_metrics.csv")
import urllib.request, json
with urllib.request.urlopen("https://raw.githubusercontent.com/fedhere/PUI2015_EC/master/mam1612_EC/nyc-zip-code-tabulation-areas-polygons.geojson") as url:
data_geo = json.loads(url.read().decode())
county=pd.read_csv('https://query.data.world/s/buotxun7n3q6twzd6havwil7h5nqpb')
county["ZIP"]=county["ZIP"].astype(str)
county.head()
data1=data1.groupby(["zipcode"],as_index=False).mean()[["longitude","latitude","zipcode","break_even_years","cost_2020",\
"Annual_rent","roi_equity","roi_10_yrs"]]
county=pd.read_csv('https://query.data.world/s/buotxun7n3q6twzd6havwil7h5nqpb')
county["ZIP"]=county["ZIP"].astype(str)
data1['zipcode'] = data1['zipcode'].astype(str)
data1=data1.merge(county,how="left",left_on="zipcode",right_on="ZIP")
tmp=data_geo
geozips = []
for i in range(len(tmp['features'])):
if tmp['features'][i]['properties']['postalCode'] in list(data1['zipcode'].unique()):
geozips.append(tmp['features'][i])
new_json = dict.fromkeys(['type','features'])
new_json['type'] = 'FeatureCollection'
new_json['features'] = geozips
king_geo=open("cleaned_geodata.json", "w").write(json.dumps(new_json, sort_keys=True, indent=4, separators=(',', ': ')))
from folium import FeatureGroup, LayerControl, Map, Marker
king_geo= "cleaned_geodata.json"
from folium.plugins import MarkerCluster
map = folium.Map(location=[40.693943, -73.985880], default_zoom_start=15,tiles="cartodbpositron")
k=folium.Choropleth(geo_data=king_geo,
data=data1, # my dataset
columns=['zipcode', 'break_even_years'], # zip code is here for matching the geojson zipcode, sales price is the column that changes the color of zipcode areas
key_on='feature.properties.postalCode', # this path contains zipcodes in str type, this zipcodes should match with our ZIP CODE column
fill_color='YlOrRd', fill_opacity=0.8, line_opacity=0.8,
legend_name='break_even_years',name="break_even_years").add_to(map)
marker_cluster = MarkerCluster().add_to(k)
for i in range(data1.shape[0]):
location = [data1['latitude'][i],data1['longitude'][i]]
tooltip = "Zipcode:{}<br> COUNTYNAME: {}<br> Click for more".format(data1["zipcode"][i],data1["COUNTYNAME"][i])
folium.Marker(location, # adding more details to the popup screen using HTML
popup="""
<i>Break even years: </i> <br> <b>{}</b> <br>
<i>Annual Rent: </i><b><br>{}</b><br>
<i>Selling price: </i><b><br>{}</b><br>""".format(
round(data1['break_even_years'][i],2),
round(data1['Annual_rent'][i],2),
round(data1['cost_2020'][i],2)),
tooltip=tooltip).add_to(marker_cluster)
n=folium.Choropleth(geo_data=king_geo,
data=data1, # my dataset
columns=['zipcode', 'break_even_years'], # zip code is here for matching the geojson zipcode, sales price is the column that changes the color of zipcode areas
key_on='feature.properties.postalCode', # this path contains zipcodes in str type, this zipcodes should match with our ZIP CODE column
fill_color='PuBuGn', fill_opacity=0.8, line_opacity=0.8,
legend_name='roi_equity',name="roi_equity").add_to(map)
marker_cluster2 = MarkerCluster().add_to(n)
for i in range(data1.shape[0]):
location = [data1['latitude'][i],data1['longitude'][i]]
tooltip = "Zipcode:{}<br> County: {}<br> Click for more".format(data1["zipcode"][i],data1["COUNTYNAME"][i])
folium.Marker(location, # adding more details to the popup screen using HTML
popup="""
<i>County:</i> <br> <b>{}</b> <br>
<i>ROI_10: </i><b><br>{}</b><br>
<i>Breakeven years: </i><b><br>{}</b><br>""".format(
data1['COUNTYNAME'][i],
round(data1['roi_10_yrs'][i],2),
round(data1['break_even_years'][i],2)),
tooltip=tooltip).add_to(marker_cluster2)
LayerControl().add_to(map)
map